Example of stretching statement across multiple lines.
One long line
statement option1 option2 option3 option4;
versus several short lines.
statement
option1
option2
option3
option4;
Rules for variable names (1/2)
Can use mix of
letters (A-Z, a-z),
numbers (0-9)
underscore (_)
no blanks, no symbols
Rules for variable names (2/2)
Can’t start with a number
“a1” but not “1a”
Capitalization not important
BMI, Bmi, bmi are same
Up to 32 characters in length
Recommendations for variable names (1/2)
Avoid generic names (x1, var01, etc.)
Keep it short
Use commonly known abbreviations…
…but nothing cryptic
Use all lower case (age, not AGE or Age)
Recommendations for variable names (2/2)
Separate words with underscores
fat_brozek, not fatbrozek
Alternative: CamelCase
FatBrozek
Caution: Writer’s Exchange website
www.writersexchange.com
SAS variable labels (1/2)
Longer description of a variable
Can include blanks, special symbols
Internal documentation
Labels substituted on some (but not all) output
Required in this class (see grading rubric)
SAS variable labels (2/2)
Recommendations for variable labels
Judicious use of upper and lower case
Spell out abbreviations
Specify units of measurement
Any other important details
Documenting your program (1)
* 5507-02-simon-continuous-variables.sas
author: Steve Simon
date: created 2021-05-30
purpose: to work with continuous variables
license: public domain;
* datasets created in this program
body, original data
body1, row with ht=29.5 removed
body2, ht=29.5 converted to missing
body3, ht_cm calculated;
data module02.body;
infile rawdata;
input
case
fat_brozek
fat_siri
dens
age
wt
ht
bmi
ffw
neck
chest
abdomen
hip
thigh
knee
ankle
biceps
forearm
wrist;
* Some additional details about this data:
Brozek's equation is 457/Density - 414.2
Siri's equation is 495/Density - 450
Abdomen circumference is measured at the
umbilicus and level with the iliac crest
Wrist circumference is distal to the
styloid processes;
The footnote subcommand (6)
proc print
data=module02.body(obs=10);
var case fat_brozek fat_siri dens age;
title1 "Ten rows and five columns";
title2 "of the body data set";
footnote1 "Created by Steve Simon on &sysdate using SAS &sysver";
run;
Displaying metadata (7)
proc contents
data=module02.body;
title1 "Internal description of body dataset";
run;
Live demo, 1
data step
label subcommand
contents procedure
Break #1
What you have learned
Using variable labels
What’s coming next
Simple descriptive statistics
Simple descriptive statistics
Always look at first
Is mean high, normal, or low?
Is data spread out or tight?
Zero standard deviation is a red flag
Are minimum and maximum reasonable?
Computing simple statistics (8)
proc means
n mean std min max
data=module02.body;
var ht;
title1 "Descriptive statistics for ht";
title2 "The mean is normal for adults";
title3 "The standard deviation shows tightly packed data";
title4 "The maximum value is reasonable";
title2 "The minimum is very low";
run;
Live demo, 2
means procedure
n option
mean option
std option
min option
max option
Break #2
What you have learned
Simple descriptive statistics
What’s coming next
Printing row with smallest/largest value
Sorting your data
Uses the sort procedure
Specify the dataset with data=
Specify the sorting variable with the by subcommand
Use descending keyword to sort in reverse order
Printing row with smallest or largest value
Investigate other variables associated with outlier
Is the data shifted left or right?
Are other values consistent with the outlier?
Printing row with smallest value (9)
proc sort
data=module02.body;
by ht;
run;
proc print
data=module02.body(obs=1);
title1 "The row with the smallest ht";
title2 "Note the inconsistency with wt";
run;
Printing row with largest value (10)
proc sort
data=module02.body;
by descending ht;
run;
proc print
data=module02.body(obs=1);
title1 "The row with the largest ht";
title2 "This seems quite normal to me";
run;
Live demo, 3
sort procedure
by subcommand
descending option
Break #3
What you have learned
Printing row with smallest/largest value
What’s coming next
Missing value logic
What to do with outliers
Depends on the context, ask for help!
Live with it
Delete the entire observation
Convert the value to missing
How to handle outliers
No option is best in all cases
Live with them
Remove the entire row
Convert outlier to missing
Always report clearly
Importing missing values
Different coding schemes
dot (.) or blank ( ), the SAS standard
other symbols (*, ?)
NA, the R standard
NULL, the SQL standard
Extreme numbers (-1, 9, 99, 999)
Blank ( ) or empty ()
Advice on importing missing values
Read the data dictionary
Always ask WHY a value is missing
Convert any non-standard missing codes
if iq=999 then iq=.
Missing value logic in SAS
Stored internally as most extremely negative number
Approximately \(-1.8 \times 10^{308}\) (on most computers)
Can identify with = . or missing()
Differs from R
Use caution with less than/greater than comparisons
age < 18 will include children AND missing ages
use age ^= . & age < 18 instead
Removing a row of data (11)
data module02.body1;
set module02.body;
if ht = 29.5 then delete;
run;
Converting outlier to missing (12)
data module02.body2;
set module02.body;
if ht=29.5 then ht=.;
run;
Printing negative values (wrong way) (13)
proc print
data=module02.body2;
where ht < 0;
title1 "Printing negative values for ht (wrong way)";
title2 "Use where ht ^= . & ht < 0 instead";
run;
Counting missing values (14)
proc means
n nmiss mean std min max
data=module02.body2;
var ht;
title1 "There is one missing value";
run;
Live demo, 4
set subcommand (data step)
if … then subcommand (data step)
where subcommand (proc)
nmiss option (means procedure)
Break #4
What you have learned
Missing value logic
What’s coming next
Simple transformations
Transforming values
Use data step to create a new variable
Unit conversion: temperature = 5/9 * (temperature - 32)
New variable: bmi = wt_kg / ht_m^2
Different variable, same data
data name1;
set name1;
wt_kg = wt / 2.2;
run;
Same variable, different data
data name2;
set name1;
wt = wt / 2.2;
run;
Transforming values (15)
data module02.body3;
set module02.body;
check_bmi = (wt / 2.2) / (ht / 39.37)**2;
check_ht = sqrt((wt / 2.2) / bmi) * 39.37;
check_wt = (bmi * (ht / 39.37)**2) * 2.2;
run;
proc print
data=module02.body3;
var ht check_ht wt check_wt bmi check_bmi;
where ht=29.5;
title1 "Recalculating ht, wt, and bmi";
title2 "Assuming two out of three are correct.";
run;
Reminder
\(bmi = \frac{wt}{ht^2}\), if wt and ht correct
\(ht = \sqrt{\frac{wt}{bmi}}\), if wt and bmi correct
Today’s technology, 3D printers (SAS does not support these)
Drawing histograms
Histograms can assess normality/non-normality
Skewness
Bimodal distributions
Outliers
How many bars? Multiple recommendations
Five to twenty bars
Square root of n bars
Trial and error
Drawing a histogram (default) (16)
proc sgplot
data=module02.body2;
histogram ht;
title1 "Histogram shows a roughly normal distribution";
title2 "Default bins (not recommended)";
run;
Drawing a histogram (fewer bars) (17)
proc sgplot
data=module02.body2;
histogram ht / binstart=60 binwidth=5 nofill;
xaxis values=(60 to 85 by 5);
title "Histogram with wide bins (better)";
run;
Drawing a histogram (more bars) (18)
proc sgplot
data=module02.body2;
histogram ht / binstart=62 binwidth=1 nofill;
xaxis values=(62 to 80 by 2);
title "Histogram with narrow bins (best)";
run;
Live demo, 6
sgplot procedure
histogram subcommand
binstart option
binwidth option
Break #6
What you have learned
Histograms
What’s coming next
Correlations
Correlations
Informal interpretation
between +0.7 and +1.0: strong positive association
between +0.3 and +0.7: weak positive association
between -0.3 and +0.3: little or no association
between -0.3 and -0.7: weak positive association
between -0.7 and -1.0: strong negative association
Computing correlations (default) (19)
proc corr
data=module02.body2
noprint
outp=correlations;
var fat_brozek fat_siri;
with neck -- wrist;
run;
Processing correlations (20)
data correlations;
set correlations;
if _type_ NE "CORR" then delete;
drop _type_;
fat_brozek=round(fat_brozek, 0.01);
fat_siri=round(fat_siri, 0.01);
run;
Sorting the correlations (21)
proc sort
data=correlations;
by descending fat_brozek;
run;
proc print
data=correlations;
title1 "Abdomen, hip, and chest show the strongest correlations";
run;
Live demo, 7
corr procedure
noprint option
outp option
with subcommand
drop subcommand (data step)
round function
Break #7
What you have learned
Correlations
What’s coming next
Scatterplots
Drawing a scatterplot (22)
proc sgplot
data=module02.body2;
scatter x=abdomen y=fat_brozek /
markerattrs=(size=10 symbol=circle);
pbspline x=abdomen y=fat_brozek /
lineattrs=(pattern=dash color=red);
title1 "Simple scatterplot shows a strong positive trend";
title2 "It levels off for high values.";
title3 "This may be due solely to a single outlier on the high end";
run;